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Summary 


This  paper  presents  a  comprehensive  review  of  the  research 
literature  on  an  aspect  of  probability  assessment  called 
"calibration."  Calibration  measures  the  validity  of  probability 
assessments.  Being  well-calibrated  is  critical  for  optimal 
decision-making  and  for  the  development  of  decision-aiding 
techniques . 

Subjective  probability  assessments  play  a  key  role  in 
decision  making.  It  is  often  necessary  to  rely  on  an  expert 
to  assess  the  probability  of  some  future  event.  How  good  are 
such  assessments?  One  important  aspect  of  their  quality  is 
called  calibration.  Formally,  an  assessor  is  calibrated  if, 
over  the  long  run,  for  all  statements  assigned  a  given 
probability  (e.g. ,  the  probability  is  .65  that  "Romania  will 
maintain  its  current  relation  with  People's  China  for  the  next 
six  months."),  the  proportion  that  is  true  is  equal  to  the 
probability  assigned.  For  example,  if  you  are  well  calibrated, 
then  across  all  the  many  occasions  that  you  assign  a  probability 
of  .8,  in  the  long  run  80%  of  them  should  turn  out  to  be  true. 

If,  instead,  only  70%  are  true,  you  are  not  well  calibrated, 
you  are  over conf iden t .  If  95%  of  them  are  true,  you  are 
underconfident . 

While  this  characteristic  of  assessors  has  obvious 
importance  for  applied  situations,  people's  calibration  has 
rarely  been  discussed  by  decision  analysts  or  decision  advisors. 

In  the  last  few  years,  there  has  developed  an  extensive 
literature  about  calibration,  reporting  both  laboratory  and 
real-world  experiments.  It  is  now  time  to  review  this  literature, 
to  look  for  common  findings  which  can  be  used  to  improve 
decisions,  and  to  identify  unsolved  problems. 
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Findings 

Two  general  classes  of  calibration  problem  have  been 
studied.  The  first  class  is  calibration  for  events  for  which  Ij, 

the  outcome  is  discrete.  These  include  probabilities  assigned  J 

t 

to  statements  like  "I  know  the  answer  to  that  question,"  "They  , 

are  planning  an  attack,"  or  "Our  alarm  system  is  foolproof." 

For  such  tasks,  the  following  generalizations  are  justified  | 

by  the  research: 


1.  Weather  forecasters,  who  typically  have  had  several 
years  of  experience  in  assessing  probabilities,  are  quite  well 
calibrated. 

2.  Other  experiments,  using  a  wide  variety  of  tasks  and 
subjects,  show  that  people  are  generally  quite  poorly  calibrated. 
In  particular,  people  act  as  though  they  can  make  much  finer 
distinctions  in  their  degree  of  uncertainty  than  is  actually 

the  case. 

3.  Overconfidence  is  found  in  most  tasks;  that  is,  people 
tend  to  overestimate  how  much  they  know. 


4.  The  degree  of  overconfidence  untutored  assessors  show 
is  a  function  of  the  difficulty  of  the  task.  The  more  difficult 
the  task,  the  greater  the  overconfidence. 

5.  Training  can  improve  calibration  only  to  a  limited 
extent. 

The  second  class  of  tasks  is  calibration  for  probabilities 
assigned  to  uncertain  continuous  quantities.  For  example,  what 
is  the  mean  time  between  failures  for  this  system?  How  much 

ii 


i 


will  this  project  cost?  The  assessor  must  report  a  probability 
density  function  across  the  possible  values  of  such  uncertain 
quantities.  The  usual  method  for  eliciting  such  probability 
density  functions  is  to  assess  a  small  number  of  fractiles  of 
the  function.  The  .25  fractile,  for  example,  is  that  value  of 
the  uncertain  quantity  such  that  there  is  just  a  25%  chance 
that  the  true  value  will  be  smaller  than  the  specified  value. 
Suppose  we  had  a  person  assess  a  large  number  of  .25  fractiles. 

The  assessor  would  be  giving  numbers  such  that,  for  example, 

"There  is  a  25%  chance  that  this  repair  will  be  done  in  less 
than  x^  hours"  and  "There  is  a  25%  chance  that  Warsaw  Pact 
personnel  in  Czechoslovakia  number  less  than  x.. . "  This  person 
will  be  well  calibrated  if,  over  a  large  set  of  such  estimates, 
just  25%  of  the  true  values  turn  out  to  be  less  than  the  x-value 
specified  for  each  one.  The  measures  of  calibration  used  most 
frequently  in  research  consider  pairs  of  extreme  fractiles.  For 
example,  experimenters  assess  calibration  by  asking  whether  98% 
of  the  true  values  fall  between  an  assessor's  .01  and  .99 
fractiles. 

For  calibration  of  continuous  quantities,  the  following 
results  summarize  the  research. 

1.  A  nearly  universal  bias  is  found:  assessors'  probability 
density  functions  are  too  narrow.  For  example,  20  to  50%  of 

the  true  values  lie  outside  the  .01  and  .99  fractiles,  instead 
of  the  prescribed  2%.  This  bias  reflects  overconfidence;  the 
assessors  think  they  know  more  about  the  uncertain  quantities 
than  they  actually  do  know. 

2.  Some  data  from  weather  forecasters  suggests  that  they 
are  not  overconfident  in  this  task.  But  it  is  unclear  whether 
this  is  due  to  training,  experience,  special  instructions,  or 
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the  specific  uncertain  quantities  they  deal  with  (e.g., 
tomorrow’s  high  temperature). 


3.  A  few  studies  have  indicated  that,  with  practice,  people 
can  learn  to  become  somewhat  better  calibrated. 

Implications 

Since  assessed  probabilities  are  central  to  a  wide  variety 
of  decision  problems  (e.g.,  making  intelligence  estimates, 
assessing  system  reliability,  projecting  costs,  deciding  whether 
to  acquire  more  information) ,  the  question  of  whether  such 
probabilities  are  calibrated  has  far-reaching  importance. 

Almost  all  decision  analyses  involve  probability  assessments. 

If  these  assessments  are  in  error,  the  finest  analysis  relying 
on  them  may  be  faulty.  The  bias  towards  overconfidence  reported 
here  is  widespread  and  well  documented.  What  is  not  so  well 
established  is  whether,  and  how,  this  bias  can  be  overcome 
through  training.  The  superior  performance  of  weather  fore¬ 
casters  is  encouraging.  These  people  have  been  using 
probabilities  in  their  forecasts  on  a  daily  basis  for  several 
years ;  one  might  assume  that  this  experience  accounts  for 
their  excellence.  Further  research  is  needed  to  document  just 
how  much  training,  with  what  kind  of  feedback,  is  most  efficient 
for  improving  assessors'  calibration.  Such  research  is  crucial 
to  developing  a  viable  decision  analysis  technology.  It  also 
helps  tell  us  how  much  faith  to  put  in  the  probability 
assessments  and  decisions  of  untrained  decision  makers  working 
without  the  benefit  of  decision  aids. 
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CALIBRATION  OF  PROBABILITIES:  THE  STATE  OF  THE  ART  TO  1980 

Introduction 

From  the  subjectivist  point  of  view  (de  Finetti,  1937)  a 
probability  is  a  degree  of  belief  in  a  proposition,  it  expresses 
a  purely  internal  state;  there  is  no  "right,"  "correct,"  or 
"objective"  probability  residing  somewhere  "in  reality"  against 
which  one's  degree  of  belief  can  be  compared.  In  many  circum¬ 
stances,  however,  it  may  become  possible  to  verify  the  truth 
or  falsity  of  the  proposition  to  which  a  probability  was 
attached.  Today,  one  assesses  the  probability  of  the  proposition 
"it  will  rain  tomorrow."  Tomorrow,  one  looks  at  the  rain  gauge 
to  see  whether  or  not  it  has  rained.  When  possible,  such 
verification  can  be  used  to  determine  the  adequacy  of  probability 
assessments . 

Winkler  and  Murphy  (1968b)  have  identified  two  kinds  of 
"goodness"  in  probability  assessments:  normative  goodness,  which 
reflects  the  degree  to  which  assessments  express  the  assessor's 
true  beliefs  and  conform  to  the  axioms  of  probability  theory,  and 
substantive  goodness,  which  reflects  the  amount  of  knowledge  of 
the  topic  area  contained  in  the  assessments.  This  paper  reviews 
the  literature  concerning  yet  another  aspect  of  "goodness," 
called  calibration. 

If  a  person  assesses  the  probability  of  a  proposition  being 
true  as  .7  and  later  finds  that  the  proposition  is  false,  that  in 
itself  does  not  invalidate  the  assessment.  However,  if  a  judge 
assigns  .7  to  10,000  independent  propositions,  only  25  of  which 
subsequently  are  found  to  be  true,  there  is  something  wrong  with 
these  assessments.  The  attribute  that  they  lack  is  called 
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calibration;  it  has  also  been  called  realism  (Brown  &  Shuford, 
1973)  ,  external  validity  (Brown  &  Shuford,  1973) ,  realism  of 
confidence  (Adams  &  Adams,  1961),  appropriateness  of  confidence 
(Oskamp,  1962),  secondary  validity  (Murphy  &  Winkler,  1971),  and 
reliability  (Murphy,  1973).  Formally,  a  judge  is  calibrated  if, 
over  the  long  run,  for  all  propositions  assigned  a  given 
probability,  the  proportion  true  equals  the  probability  assigned. 
Judges'  calibration  can  be  empirically  evaluated  by  observing 
their  probability  assessments,  verifying  the  associated 
propositions,  and  then  observing  the  proportion  true  in  each 
response  category. 

The  experimental  literature  on  the  calibration  of  assessors 
making  probability  judgments  about  discrete  propositions  is 
reviewed  in  the  first  section  of  this  paper.  The  second  section 
looks  at  the  calibration  of  probability  density  functions 
assessed  for  uncertain  numerical  quantities.  Although 
calibration  is  essentially  a  property  of  individuals,  most  of 
the  studies  reviewed  here  have  reported  data  grouped  across 
assessors  in  order  to  secure  the  large  quantities  of  data  needed 
for  stable  estimates  of  calibration. 

Discrete  Propositions 

Discrete  propositions  can  be  characterized  according  to  the 
number  of  alternatives  they  offer; 

No  alternatives;  "What  is  absinthe?"  The  assessor  provides 
an  answer,  and  then  gives  the  probability  that  the  answer  given 
is  correct.  The  entire  range  of  probability  responses,  from  0  to 
1,  is  appropriate. 
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One  alternative:  "Absinthe  is  a  precious  stone.  What  is 
the  probability  that  this  statement  is  true?"  Again,  the 
relevant  range  of  the  probability  scale  is  0  to  1. 

Two  alternatives:  "Absinthe  is  (a)  a  precious  stone;  (b)  a 
liqueur."  With  the  half-range  method,  the  assessor  first  selects 
the  more  likely  alternative  and  then  states  the  probability  (i  .5) 
that  this  choice  is  correct.  With  the  full-range  method,  the 
subject  gives  the  probability  (from  0  to  1)  that  a  prespecified 
alternative  is  correct. 

Three  or  more  alternatives:  "Absinthe  is  (a)  a  precious 
stone;  (b)  a  liqueur;  (c)  a  Caribbean  island;  (d)  .  .  .  .  "  Two 
variations  of  this  task  may  be  used:  (1)  the  assessor  selects 
the  single  most  likely  alternative  and  states  the  probability  that 
it  is  correct,  using  a  response  £  1/k  for  k  alternatives  or 
(2)  the  assessor  assigns  probabilities  to  all  alternatives,  using 
the  range  0  to  1. 

For  all  these  variations,  calibration  may  be  reported  via  a 
calibration  curve.  Such  a  curve  is  derived  as  follows: 

(1)  Collect  many  probability  assessments  for  items  whose  correct 
answer  is  known  or  will  shortly  be  known  to  the  experimenter. 

(2)  Group  similar  assessments,  usually  within  ranges  (e.g.,  all 
assessments  between  .60  and  .69  are  placed  in  the  same  category). 

(3)  Within  each  category,  compute  the  proportion  correct  (i.e., 

the  proportion  of  items  for  which  the  proposition  is  true  or  the 
alternative  is  correct) .  (4)  For  each  category,  plot  the  mean 

response  (on  the  abscissa)  against  the  proportion  correct  (on 
the  ordinate) .  Perfect  calibration  would  be  shown  by  all  points 
falling  on  the  identity  line. 


For  half-range  tasks,  badly  calibrated  assessments  can  be 
either  overconfident ,  whereby  the  proportions  correct  are  less 
than  the  assessed  probabilities ,  so  that  the  calibration  curve 
falls  below  the  identity  line,  or  underconfident ,  whereby  the 
proportions  correct  are  greater  than  the  assessed  probabilities 
and  the  calibration  curve  lies  above  the  identity  line. 

For  full-range  tasks  with  zero  or  one  alternative, 
overconfidence  has  two  possible  meanings.  Assessors  could  be 
overconfident  in  the  truth  of  the  answer;  such  overconfidence 
would  be  indicated  by  a  calibration  curve  falling  always  below 
the  identity  line.  Alternatively,  assessors  could  be  overconfid¬ 
ent  in  their  ability  to  discriminate  true  from  false  propositions. 
Such  overconfidence  would  be  shown  by  a  calibration  curve  below 
the  identity  line  in  the  region  above  . 5  and  above  the  identity 
line  in  the  region  below  .5. 

Several  numerical  measures  of  calibration  have  been  proposed. 
Murphy  (1973)  has  explored  the  general  case  of  k-alternative 
items,  starting  with  the  Brier  score  (1950),  a  general  measure  of 
overall  goodness  of  probability  assessments  such  that  the 
smaller  the  score,  the  better.  The  Brier  score  for  N  items  is; 


B  -  fj  j  <Si  -  £i>  <Ei  -  £i>' 


where  r^  is  a  vector  of  the  assessed  probabilities  for  the  k 


alternatives  of  the  i'th  item,  r^  =  (r.^, 
associated  outcome  vector,  c^  =  (c^,  ... 


rki>- 


,  c^, 


c^  is  the 
cki)  ' 


where  c^  equals  one  for  the  true  alternative  and  zero  otherwise, 
and  the  prime  ( ' )  denotes  a  column  vector .  Murphy  showed  that 
the  Brier  score  can  be  partitioned  into  three  additive  parts. 
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To  do  so,  sort  the  N  response  vectors  into  T  subcollections 
such  that  all  the  response  vectors,  rfc,  in  subcollection  t  are 
identical.  Let  nfc  be  the  number  of  responses  in  the  t'th 
subcollection,  and  let  ct  be  the  proportion-correct  vector  for 
the  t'th  subcollection: 


nt 

E 

t=l 


Cjt/nt 


ckt* '  where 


Let  c  be  the  proportion-correct  vector  across  all  responses, 


N 


cj  "t  £ 


i-1  ^ 


c^) ,  where 


Finally,  let  u  be  the  unity  vector,  a  row  vector  whose  k  elements 
are  all  one. 

Then  Murphy's  partition  of  the  Brier  score  is: 


B  =  c (u-c) ' 


♦i 


T 

E 

t=l 


nt(Et-ct)(rt-ct)‘  - 


5  V£t-S>  «t-S>’ 


The  first  term  is  not  a  function  of  the  probability 
assessments;  rather,  it  reflects  the  relative  frequency  of  true 
events  across  the  k  alternatives.  For  example,  suppose  all  the 


items  being  assessed  had  the  same  two  alternatives,  {rain,  no  rain}. 
Then  the  first  term  of  the  partition  is  a  function  of  the  base  rate 
of  rain  across  the  N  items  (or  days) .  If  it  always  (or  never) 
rained,  this  term  would  be  zero.  Its  maximum  value,  (k-l)/k, 
would  indicate  maximum  uncertainty  about  the  occurrence  of  rain. 

The  second  term  is  a  measure  of  calibration,  the  weighted  average 
of  the  squared  difference  between  the  responses  in  a  category  and 
the  proportion  correct  for  that  category.  The  third  term,  called 
"resolution,"  reflects  the  assessor's  ability  to  sort  the  events 
into  subcategories  for  which  the  proportion  correct  is  different 
from  the  overall  proportion  correct. 

Murphy's  partition  was  designed  for  repeated  predictions  of 
the  same  set  of  events  (e.g. ,  rain  vs.  no  rain).  When  the 
alternatives  have  no  common  meaning  across  items  (e.g.,  in  a 
multiple-choice  examination,  then  all  that  the  first  term  indicates 
is  the  extent  to  which  the  correct  answers  appear  equally  often 
as  the  first,  second,  etc.,  alternative. 

When  only  one  response  per  item  is  scored,  Murphy's 
partition  (Murphy,  1972)  reduces  to: 


B'  -  c(l-c)  +  N  ^  nt(rfc-ct)  -  N  nt(ct-c) 


where  c  is  the  overall  proportion  correct,  and  cfc  is  the 
proportion  correct  in  the  t'th  subcategory.  When  the  scored 
response  is  the  response  >  .5  (as  with  the  two- a Iter native , 
half-range  task),  the  first  term  reflects  the  subject's  ability 
to  pick  the  correct  alternative,  and  thus  might  be  called 
"knowledge."  As  before,  the  second  term  measures  calibration, 
and  the  third  resolution. 


Similar  measures  of  calibration  have  been  proposed  by 
Adams  and  Adams  (1961)  and  by  Oskamp  (1962) .  None  of  these 
measures  of  calibration  discriminates  overconfidence  from 
underconfidence.  The  sampling  properties  of  these  measures  are 
not  known. 

Meteorological  Research 

In  1906,  W.  Ernest  Cooke,  Government  Astronomer  for 
Western  Australia,  advocated  that  each  meteorological  prediction 
be  accompanied  by  a  single  number  that  would  "indicate, 
approximately,  the  weight  or  degree  of  probability  which  the 
forecaster  himself  attaches  to  that  particular  prediction." 

He  reported  (Cooke,  1906a,  b)  results  from  1,951  predictions. 

Of  those  to  which  he  had  assigned  the  highest  degree  of 
probability  ("almost  certain  to  be  verified"),  .985  were  correct. 
For  his  middle  degree  of  probability  ("normal  probability"), 

.938  were  correct,  while  for  his  lowest  degree  of  probability 
("doubtful"),  .787  were  correct. 

In  1951,  Williams  asked  eight  professional  Weather  Bureau 
forecasters  in  Salt  Lake  City  to  assess  the  probability  of 
precipitation  for  each  of  1,095  12-hour  forecasts,  using  one  of 
the  numbers  0,  .2,  .4,  .6,  .8,  or  1.0.  Throughout  most  of  the 
range,  the  proportion  of  precipitation  days  was  lower  than  the 
probability  assigned.  This  might  reflect  a  fairly  natural  form 
of  hedging  in  public  announcements.  People  are  much  more  likely 
to  criticize  a  weather  forecast  that  leads  them  to  be  without 
an  umbrella  when  it  rains  than  one  that  leads  them  to  carry  an 
umbrella  on  dry  days. 


7 


Similar  results  emerged  from  a  study  by  Murphy  and  Winkler 
(1974) .  Their  forecasters  assessed  the  probability  of 
precipitation  for  the  next  day  twice,  before  and  after  seeing 
output  from  a  computerized  weather  prediction  system  (PEATMOS) . 

The  7,188  assessments  (before  and  after  PEATMOS)  showed  the  same 
overestimation  of  the  probability  of  rain  found  by  Williams. 

Sanders  (1958)  collected  12,635  predictions,  using  the 
eleven  responses  0,  .1,  ...  ,  .9,  1.0,  for  a  variety  of 
dichotomized  events:  wind  direction,  wind  speed,  gusts, 
temperatures,  cloud  amount,  ceiling,  visibility,  precipitation 
occurrence,  precipitation  type,  and  thunderstorm.  These  data 
revealed  only  a  slight  tendency  for  the  forecasters'  probability 
assessments  to  exceed  the  proportion  of  weather  events  that 
occurred.'1'  Root  (1962)  reported  a  symmetric  pattern  of 
calibration  of  4,138  precipitation  forecasts:  assessed 
probabilities  were  too  low  in  the  low  range  and  .too  high  in  the 
high  range,  relative  to  the  observed  frequencies. 

Winkler  and  Murphy  (1968a)  reported  calibration  curves  for 
an  entire  year  of  precipitation  forecasts  from  Hartford, 
Connecticut.  Each  forecast  was  for  either  a  six-hour  or  a 
twelve-hour  time  period,  with  a  lead  time  varying  from  5  to  44 
hours.  Unfortunately,  it  was  unclear  whether  the  forecasters 
had  included  "a  trace  of  precipitation"  (less  than  .01  inch)  in 
their  predictions.  The  data  were  analyzed  twice,  once  assuming 
that  "precipitation"  included  the  occurrence  of  traces  and 
again  without  traces.  The  inclusion  or  exclusion  of  traces  had 
a  substantial  effect  on  calibration,  as  did  the  time  period. 
Six-hour  forecasts  with  traces  included  and  twelve-hour  forecasts 
excluding  traces  exhibited  excellent  calibration.  The  calibration 
curve  for  twelve-hour  forecasts  with  traces  lay  above  the 
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identity  line;  the  curve  for  six-hour  forecasts  excluding  traces 
lay  well  below  it.  Variations  in  lead  time  did  not  affect 
calibration. 

National  Weather  Service  forecasters  have  been  expressing 

their  forecasts  of  precipitation  occurrence  in  probabilistic 

terms  since  1965.  The  calibration  for  some  parts  of  this 

massive  data  base  has  been  published  (Murphy  &  Winkler,  1977a; 

U.S.  Weather  Bureau,  1969) .  Over  the  years  the  calibration  has 

improved.  Figure  1  shows  the  calibration  for  24,859  precipitation 

forecasts  made  in  Chicago  during  the  four  years  ending  June  1976. 

This  shows  remarkably  good  calibration;  Murphy  says  the  data  for 

recent  years  are  even  better!  He  attributes  this  superior 

performance  to  the  experience  with  probability  assessment  that 

the  forecasters  have  gained  over  the  years  and  to  the  fact  that 

2 

these  data  were  gathered  from  real  on-the-job  performance. 

Early  Laboratory  Research 

In  1957,  Adams  reported  the  calibration  of  subjects  who  used 
an  eleven-point  confidence  scale;  "[the  subject  was]  instructed 
to  express  his  confidence  in  terms  of  the  percentage  of  responses, 
made  at  that  particular  level  of  confidence,  that  he  expects  to 
be  correct  ....  Of  those  responses  made  with  confidence  £, 
about  p%  should  be  correct"  (pp.  432-433) . 

In  Adams'  task,  each  of  four  words  were  presented 
tachistoscopically  ten  times  successively,  with  increasing 
illumination  each  time,  to  ten  subjects.  After  each  exposure 
subjects  wrote  down  the  word  they  thought  they  saw  and  gave  a 
confidence  judgment.  The  resulting  calibration  curve  showed 
that  proportions  correct  greatly  exceeded  the  confidence  ratings 
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Figure  1.  Calibration  data  for  precipitation 
forecasts.  The  number  of  forecasts  is  shown 
for  each  point.  Source:  Murphy  &  Winkler, 
1971a. 
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along  the  entire  response  scale  (except  for  the  responses  of 
100).  Great  caution  must  be  taken  in  interpreting  these  data: 
because  each  word  was  shown  10  times ,  the  responses  are  highly 
interdependent.  It  is  unknown  what  effect  such  interdependence 
has  on  calibration.  Subjects  may  have  chosen  to  "hold  back"  on 
early  presentations,  unwilling  to  give  a  high  response  when 
they  knew  that  the  same  word  would  be  presented  several  more 
times . 

The  following  year,  Adams  and  Adams  (1958)  reported  a 
training  experiment,  using  the  same  response  scale,  but  a  new, 
three-alternative,  single-response  task:  For  each  of  156  pairs 
of  words  per  session,  subjects  were  asked  whether  the  words 
were  antonyms,  synonyms,  or  unrelated.  The  mean  calibration 
scores  (based  on  the  absolute  difference,  |rt-c  |)  of  14 
experimental  subjects ,• who  were  shown  calibration  tallies  and 
calibration  curves  after  each  of  five  sessions,  decreased  by  48% 
from  the  first  session  to  the  last.  Six  control  subjects,  whose 
only  feedback  was  a  tally  of  their  unscored  responses,  showed  a 
36%  mean  increase  in  discrepancy  scores. 

Adams  and  Adams  (1961)  discussed  many  aspects  of  calibration 
(using  the  term  "realism  of  confidence") ,  anticipating  much  of 
the  work  done  by  others  in  recent  years ,  and  presented  more  bits 
of  data,  including  the  grossly  overconfident  calibration  curve 
of  a  schizophrenic  who  believed  he  was  Jesus  Christ.  In  a 
nonsense-syllable  learning  task,  they  found  large  overconfidence 
on  the  first  trial  and  improvement  after  16  trials.  They  also 
briefly  described  a  transfer  of  training  experiment:  On  Day  1, 
subjects  made  108  decisions  about  the  percentage  of  blue  dots 
in  an  array  of  blue  and  red  dots.  On  Days  2  and  4,  the  subjects 
decided  on  the  truth  or  falsity  of  250  general  knowledge 
statements.  On  Day  3,  they  lifted  weights,  blindfolded.  On 
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Day  5,  they  made  256  decisions  (synonym,  antonym,  or  unrelated) 
about  pairs  of  words.  Eight  experimental  subjects,  given 
calibration  feedback  after  each  of  the  first  four  days,  showed 
on  the  fifth  day  a  mean  absolute  discrepancy  score  significantly 
lower  than  that  of  eight  control  (no  feedback)  subjects, 
suggesting  some  transfer  of  training.  Finally,  Adams  and  Adams 
reported  that  across  56  subjects  taking  a  multiple-choice  final 
examination  in  elementary  psychology,  poorer  calibration  was 
associated  with  greater  fear  of  failure  (r  =  .36).  Neither 
knowledge  nor  overconfidence  was  related  to  fear  of  failure. 

3 

Oskamp  (1962)  presented  subjects  with  200  MMPI  profiles  as 
stimuli.  Half  the  profiles  were  from  men  admitted  to  a  VA 
hospital  for  psychiatric  reasons;  the  others  were  from  men 
admitted  for  purely  medical  reasons.  The  subjects'  task  was  to 
decide,  for  each  profile,  whether  the  patient's  status  was 
psychiatric  or  medical  and  to  state  the  probability  that  their 
decision  was  correct.  Each  profile  had  been  independently 
categorized  as  hard  (61  profiles) ,  medium  (88)  ,  or  easy  (51)  on 
the  basis  of  an  actuarially-derived  classification  system, 
which  correctly  identified  57%,  69%  and  92%  of  the  hard,  medium, 
and  easy  profiles,  respectively. 

All  200  profiles  were  judged  by  three  groups  of  subjects: 

28  undergraduate  psychology  majors,  23  clinical  psychology 
trainees  working  at  a  VA  hospital,  and  21  experienced  clinical 
psychologists.  The  28  inexperienced  judges  were  later  split 
into  two  matched  groups  and  given  the  same  200  profiles  again. 
Half  were  trained  during  this  second  round  to  improve  accuracy; 
the  rest  were  trained  to  improve  calibration. 
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Oskamp  used  three  measures  of  subjects'  performance: 
accuracy  (percent  correct) ,  confidence  (mean  probability 
response) ,  and  appropriateness  of  confidence  (a  calibration 
score)  : 


1 
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All  three  groups  tended  to  be  overconfident,  especially  the 
undergraduates  in  their  first  session  (accuracy  70%,  confidence 
.78).  However,  all  three  groups  were  under confident  on  the 
easy  profiles  (accuracy  87%,  confidence  .83). 


The  subjects  trained  for  accuracy  increased  their  accuracy 
from  67%  to  73%,  approaching  their  confidence  level,  .78,  which 

4 

did  not  change  as  a  result  of  training.  The  subjects  trained 
for  calibration  lowered  their  confidence  from  .78  to  .74, 
bringing  it  closer  to  their  accuracy,  68%,  which  remained 
unchanged.  As  would  be  expected  from  these  changes,  the 
calibration  score  of  both  groups  improved. 

Signal  Detection  Research 

In  the  early  days  of  signal  detection  research, 
investigators  looked  into  the  possibility  of  using  confidence 
ratings  rather  than  Yes-No  responses  in  order  to  reduce  the 
amounts  of  data  required  to  determine  stable  receiver  operating 
characteristic  (ROC)  curves.  Swets,  Tanner,  and  Birdsall  (1961) 
asked  four  observers  to  indicate  their  confidence  that  they  had 
heard  signal  plus  noise  rather  than  noise  alone  for  each  of  1200 
trials.  Although  three  of  the  four  subjects  were  terribly 
calibrated,  the  four  calibration  curves  were  widely  different. 
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One  subject  exhibited  a  severe  tendency  to  assign  too  small 
probabilitities  (e.g.,  the  signal  was  present  over  70%  of  the 
times  when  that  subject  used  the  response  category  ".05-. 19"). 

Clarke  (1960)  presented  one  of  five  different  words,  mixed 
with  noise,  to  listeners  through  headphones.  The  listeners 
selected  the  word  they  thought  they  heard,  and  then  rated  their 
confidence  by  indicating  one  of  five  categories  defined  by 
slicing  the  probability  scale  into  five  ranges.  After  each  of 
12  practice  tests  of  75  items,  listeners  scored  their  own  results 
and  noted  the  percentage  of  correct  identifications  in  each 
rating  category,  thus  allowing  them  to  change  strategies  on  the 
next  test.  Clarke  found  that  although  all  five  listeners 
appeared  well  calibrated  when  data  were  averaged  over  the  five 
stimulus  words,  analyses  for  individual  words  showed  that  the 
listeners  tended  to  be  overconfident  for  low-intelligibility. 

Pollack  and  Decker  (1958)  used  a  verbally  defined  6-point 
confidence  rating  scale  that  ranged  from  "Positive  I  received  the 
message  correctly"  to  "Positive  I  received  the  message 
incorrectly."  With  this  rating  scale  it  is  impossible  to 
determine  whether  an  individual  is  well  calibrated,  but  it  is 
possible  to  see  shifts  in  calibration  across  conditions. 
Calibration  curves  for  easy  words  generally  lay  above  those 
for  difficult  words,  whatever  the  signal-to-noise  ratio,  and 
the  curves  for  high  signal-to-noise  ratios  lay  above  those  for 
low  signal-to-noise  ratios,  whatever  the  word  difficulty. 

In  most  of  these  studies,  calibration  was  of  secondary 
interest;  the  important  question  was  whether  confidence  ratings 
would  yield  the  same  ROC  curves  as  Yes-No  procedures.  By  1966, 
Green  and  Swets  concluded  that,  in  general,  ratings  scales  and 
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Yes-No  procedures  yield  almost  identical  ROC  curves.  Since  then, 
studies  of  calibration  have  disappeared  from  the  signal  detection 
literature. 

Recent  Laboratory  Research 

Overconfidence .  The  most  pervasive  finding  in  recent  research 
is  that  people  are  overconfident  with  general-knowledge  items  of 
moderate  or  extreme  difficulty.  Some  typical  results  showing 
overconfidence  are  presented  in  Figure  2.  Hazard  and  Peterson 
(1973)  asked  40  armed  forces  personnel  studying  at  the  Defense 
Intelligence  School  to  respond  with  probabilities  or  with  odds 
to  50  two-alternative  general-knowledge  items  (e.g..  Which  maga¬ 
zine  had  the  largest  circulation  in  1970,  Playboy  or  Time?) . 
Lichtenstein  (unpublished)  found  similar  results,  using  the  same 
items,  but  only  the  probability  response,  with  19  Oregon  Research 
Institute  employees,  as  did  Phillips  and  Wright  (1977)  with  dif¬ 
ferent  items,  using  British  undergraduate  students  as  subjects. 

Numerous  other  studies  using  general-knowledge  questions  have 
shown  the  same  overconfidence  (Nickerson  &  McGoldrick,  1965; 
Fischhoff,  Slovic  &  Lichtenstein,  1977;  Lichtenstein  &  Fischhoff, 
1977,  1980a,  1980b;  Koriat,  Lichtenstein  &  Fischhoff,  1980).  Cam¬ 
bridge  and  Shreckengost  (1978)  found  overconfidence  with  Central 
Intelligence  Agency  analysts.  Fischhoff  and  Slovic  (1980)  found 
severe  overconfidence  using  a  variety  of  impossible  or  nearly  im¬ 
possible  tasks  (e.g. ,  predicting  the  winners  in  6-furlong  horse 
races,  diagnosing  the  malignancy  of  ulcers).  Pitz  (1974)  repor¬ 
ted  overconfidence  using  a  full-range  method. 

Fischhoff,  Slovic  and  Lichtenstein  (1977)  focused  on -the  appro¬ 
priateness  of  expressions  of  certainty.  Using  a  variety  of  methods 
(no  alternatives,  one  alternative,  and  two  alternatives  with  half 
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Figure  2.  Calibration  for  half-range,  general  knowledge  items 
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range  and  full  range) ,  they  found  that  only  72  and  83  percent  of 
the  items  to  which  responses  of  1.0  were  given  were  correct.  In 
the  full-range  tasks,  items  assigned  the  other  extreme  response, 
zero,  were  correct  20  to  30  percent  of  the  time.  Using  an  odds 
response  did  not  correct  the  overconfidence.  Answers  assigned 
odds  of  1,000:1  of  being  correct  were  only  81  to  88  percent  cor¬ 
rect;  for  odds  of  one  million  to  one,  the  correct  alternative  was 
chosen  only  90  to  96  percent  of  the  time.  Subjects  showed  no  re¬ 
luctance  to  use  extreme  odds;  in  one  of  the  experiments  almost  one 
fourth  of  the  responses  were  1,000:1  or  greater.  Further  analyses 
showed  that  extreme  overconfidence  was  not  confined  to  just  a  few 
subjects  or  a  few  items. 

The  effect  of  difficulty.  Overconfidence  is  most  extreme  with 
tasks  of  great  difficulty  (Clarke,  1960;  Nickerson  &  McGoldrick, 
1965;  Pitz,  1974).  With  essentially  impossible  tasks  (discrim¬ 
inating  European  from  American  handwriting,  Asian  from  European 
children's  drawings,  and  rising  ftom  falling  stock  prices)  calibra 
tion  curves  did  not  rise  at  all;  for  all  assessed  probabilities, 
the  proportion  of  correct  alternatives  chosen  was  close  to  .5 
(Lichtenstein  &  Fischhoff,  1977) .  Subjects  were  not  reluctant 
to  use  high  probabilities  in  these  tasks;  70  to  80  percent  of  all 
responses  were  greater  than  .5. 

As  tasks  get  easier,  overconfidence  is  reduced.  Lichtenstein 
and  Fischhoff  (1977)  allowed  one  group  of  subjects  in  the  hand¬ 
writing  discrimination  task  to  study  a  correctly- labeled  set  of 
sample  stimuli  before  making  their  probability  assessments.  This 
experience  made  the  task  much  easier  (71%  correct  versus  51%  for 
the  no-study  group) ,  and  the  study  group  was  only  slightly  over¬ 
confident.  Lichtenstein  and  Fischhoff  (1977)  performed  post  hoc 
analyses  of  the  effect  of  difficulty  on  calibration  using  two 
large  collections  of  data  from  general-knowledge,  two-alternative 


half-range  tasks.  They  separated  easy  items  (those  for  which  most 
subjects  chose  the  correct  alternative)  from  hard  items  and  know¬ 
ledgeable  subjects  (those  who  selected  the  most  correct  alterna¬ 
tives)  from  less  knowledgeable  subjects.  They  found  a  systematic 
decrease  in  overconfidence  as  percent  correct  increased.  Indeed, 
the  most  knowledgeable  subjects  responding  to  the  easiest  items 
were  underconfident  (e.g.,  90%  correct  when  responding  with  a 
probability  of  .80).  This  finding  was  replicated  with  two  new 
groups  of  subjects  given  sets  of  items  chosen  to  be  hard  or  easy 
on  the  basis  of  previous  subjects'  performance.  The  resulting 
calibration  curves  are  shown  in  Figure  3,  along  with  the  corres¬ 
ponding  calibration  curves  from  the  post  hoc  analyses. 

In  the  research  just  cited,  difficulty  was  defined  on  the  basis 
of  subjects'  performance  (Clarke,  1960;  Lichtenstein  &  Fischhoff, 
1977) .  More  recently,  Lichtenstein  and  Fischhoff  (1980a) ,  fol¬ 
lowing  a  lead  of  Oskamp  (1962) ,  developed  a  set  of  500  two-alter¬ 
native  general  knowledge  items  for  which  difficulty  could  be  de¬ 
fined  independently.  The  items  were  of  three  types:  which  of  two 
cities,  states,  countries,  or  continents  is  more  populous  (e.g.. 

Las  Vegas  vs.  Miami) ,  which  of  two  cities  is  farther  in  distance 
from  a  third  city  (e.g..  Is  Melbourne  farther  from  Rome  or  from 
Tokyo?),  and  which  historical  event  happened  first  (e.g.,  Magna 
Carta  signed  vs.  Mohammed  born).  Thus,  each  item  had  associated 
with  it  two  numbers  (populations,  distances,  or  elapsed  time  to 
the  present) .  The  ratio  of  the  larger  to  the  smaller  of  those 
numbers  was  taken  as  a  measure  of  difficulty:  the  250  items  with 
the  largest  ratios  were  designated  as  easy;  the  remaining,  as 
hard.  This  a  priori  classification  was  quite  successful;  over  35 
subjects,  the  percent  correct  was  81  for  easy  items  and  58  for 
hard  items.  These  results,  too,  showed  overconfidence  for  hard 
items  and  underconfidence  for  easy  items. 


The  hard/easy  effect  seems  to  arise  from  assessors'  inability 
to  appreciate  how  difficult  or  easy  a  task  is.  Phillips  and  Choo 
(unpublished)  found  no  correlation  across  subjects  between  per¬ 
centage  correct  and  the  subjects'  ratings  on  an  11-point  scale  of 
the  difficulty  of  a  set  of  just-completed  items.  However,  subjects 
do  give  different  distributions  of  responses  for  different  tasks; 
Lichtenstein  and  Fischhoff  (1977)  reported  a  correlation  of  .91 
between  percentage  correct  and  mean  response  across  16  different 
sets  of  data.  But  the  differences  in  response  distributions  are 
less  than  they  should  be:  over  those  same  16  sets  of  data,  the 
proportion  correct  varied  from  .43  to  .92  while  the  mean  response 
varied  only  from  .65  to  .86. 

Ferrell  and  McGoey  (1980)  have  recently  developed  a  model  for 
the  calibration  of  discrete  probability  assessments  that  addresses 
the  hard/easy  effect.  The  model,  based  on  signal  detection  theory, 
assumes  that  assessors  transform  their  feelings  of  subjective  un¬ 
certainty  into  a  decision  variable,  X,  which  is  partitioned  into 
sections  with  cutoff  values  {x^.  The  assessor  reports  probability 
ri  whenever  X  lies  between  x^_^  and  x^.  Ferrell  and  McGoey  assume 
that,  in  the  absence  of  feedback  about  calibration  performance, 
the  assessor  will  not  change  the  set  of  cutoff  values,  (xi),  as 
task  difficulty  changes.  This  assumption  leads  to  a  prediction  of 
overconfidence  with  hard  items  and  underconfidence  with  easy  items. 
Application  of  the  model  to  much  of  the  data  from  Lichtenstein  and 
Fischhoff  (1977)  showed  a  moderately  good  fit  to  both  the  calibra¬ 
tion  curves  and  the  distribution  of  responses  under  the  assumption 
that  the  cutoff  values  remained  constant  as  difficulty  changed. 
Thus,  the  hard/easy  effect  is  seen  as  an  inability  to  change  the 
cutoffs  involved  in  the  transformation  from  feelings  of  certainty 
to  probabilistic  responses. 

Effect  of  base  rates.  One  alternative  (true/false)  tasks  may 
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be  characterized  by  the  proportion  of  true  statements  in  the  set 
of  items.  To  be  well  calibrated  on  a  particular  set  of  items  one 
must  take  this  base-rate  information  into  account.  The  signal- 
detection  model  of  Ferrell  and  McGoey  (1980)  assumes  that  calibra¬ 
tion  is  affected  independently  by  (a)  the  proportion  of  true 
statements  and  (b)  the  assessor's  ability  to  discriminate  true 
from  false  statements.  Assuming  that  the  cutoff  values, {x^},  are 
held  constant,  the  model  predicts  quite  different  effects  on  cali¬ 
bration  from  changing  the  proportion  of  true  statements  (while 
holding  discriminability  constant)  as  opposed  to  changing  dis- 
criminability  (while  holding  the  proportion  of  true  statements 
constant) .  Ferrell  and  McGoey  presented  data  supporting  their 
model.  Students  in  three  engineering  courses  assessed  the  proba¬ 
bility  that  the  answers  they  wrote  for  their  examinations  would 
be  judged  correct  by  the  grader.  Post  hoc  analyses  separating 
the  subjects  into  four  groups  (high  vs.  low  percentage  of  correct 
answers  and  high  vs.  low  discriminability)  revealed  the  calibra¬ 
tion  differences  predicted  by  the  model.  Unpublished  data  collec¬ 
ted  by  Fischhoff  and  Lichtenstein,  shown  in  Figure  4,  also  suggest 
support  for  the  model.  Four  groups  of  subjects  received  25  one- 
alternative  general-knowledge  items  (e.g.,  "The  Aeneid  was  written 
by  Homer")  differing  in  the  proportion  of  true  statements:  .08, 

.20,  .50,  and  .71.  The  groups  showed  dramatically  different  cali¬ 
bration  curves,  of  roughly  the  same  shape  as  predicted  by  Ferrell 
and  McGoey  for  their  base-rate  changing,  discriminability  constant 
case. 

Individual  differences.  Unqualified  statements  that  one  per¬ 
son  is  better  calibrated  than  another  person  are  difficult  to  make, 
for  two  reasons.  First,  at  least  several  hundred  responses  are 
needed  in  order  to  get  a  stable  measure  of  calibration.  Second, 
it  appears  that  calibration  strongly  depends  on  the  task,  particu- 


of  true  statements.  Source:  Fischhoff  &  Lichtenstein,  unpublished. 


larly  on  the  difficulty  of  the  task.  Indeed,  Lichtenstein  and 
Fischhoff  (1980a)  have  suggested  that  each  person  may  have  an 
"ideal"  test  (i.e.,  a  test  whose  difficulty  level  leads  to  neither 
over-  nor  underconfidence,  and  thus  the  test  on  which  the  person 
will  be  best  calibrated) .  However,  the  difficulty  level  of  the 
"ideal"  test  may  vary  across  people.  Thus,  even  when  one  person 
is  better  than  another  on  a  particular  set  of  items,  the  reverse 
may  be  true  for  a  harder  or  easier  set. 

Comparisons  between  different  groups  of  subjects  have  gen¬ 
erally  shown  few  differences  when  difficulty  was  controlled. 
Graduate  students  in  psychology,  who  presumably  are  more  intelli¬ 
gent  than  the  usual  subjects  (those  who  answered  an  ad  in  the  col¬ 
lege  newspaper) ,  were  no  different  in  calibration  (Lichtenstein  & 
Fischhoff,  1977) .  Nor  have  we  found  differences  in  calibration  or 
overconfidence  between  males  and  females  (unpublished  data,  Lich¬ 
tenstein  &  Fischhoff) . 

Wright,  Phillips,  Whalley,  Choo,  Ng,  Tan,  and  Wisudha  (1978) 
have  studied  cross-cultural  differences  in  calibration.  The  cali¬ 
bration  of  their  British  sample  was  shown  in  Figure  2.  Their  other 
samples  were  Hong  Kong,  Indonesian,  and  Malay  students.  The  Asian 
groups  showed  essentially  flat  calibration  curves.  The  authors 
speculated  that  fate-oriented  Asian  philosophies  might  account 
for  these  differences. 

Corrective  efforts.  Fischhoff  and  Slovic  (1980)  tried  to  ward 
off  overconfidence  on  the  task  of  discriminating  Asian  from  Euro¬ 
pean  children's  drawings  by  using  explicitly  discouraging  instruc¬ 
tions  : 

All  drawings  were  taken  from  the  Child  Art  Collection  of 
Dr.  Rhoda  Kellogg,  a  leading  proponent  of  the  theory  that 
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children  from  different  countries  and  cultures  make  very 
similar  drawings  .  .  .  Remember,  it  may  well  be  impossible 
to  make  this  sort  of  discrimination.  Try  to  do  the  best 
you  can.  But  if,  in  the  extreme,  you  feel  totally  uncertain 
about  the  origin  of  all  of  these  drawings ,  do  not  hesitate 
to  respond  with  .5  for  every  one  of  them  (p.  792). 

These  instructions  lowered  the  mean  response  by  about  .05,  but  sub¬ 
stantial  overconfidence  was  still  found. 

Will  increased  motivation  improve  calibration?  Sieber  (1974) 
compared  the  calibration  of  two  groups  of  students  on  a  course- 
related  set  of  four-alternative  items.  One  group  was  told  that 
they  were  taking  their  mid-term  examination.  The  other  group  was 
told  that  the  test  was  not  the  mid-term,  but  would  be  used  to 
coach  them  for  the  mid-term.  The  two  groups  did  not  differ  in  the 
number  of  correct  alternatives  chosen,  but  the  presumably  more 
motivated  group,  whose  performance  would  determine  their  grade, 
showed  significantly  worse  calibration  (greater  overconfidence) . 

Training  assessors  by  giving  them  feedback  about  their  cali¬ 
bration  has  shown  mixed  results.  As  mentioned,  Adams  and  Adams 
(1958)  found  modest  improvement  in  calibration  after  five  training 
sessions  and,  in  a  later  study  (1961) ,  some  generalization  of 
training.  Choo  (1976) ,  using  only  one  training  session  with  75  two 
alternative  general-knowledge  items ,  found  little  improvement  and 
no  generalization. 

Lichtenstein  and  Fischhoff  (1980b)  trained  two  groups  of  sub¬ 
jects  by  giving  extensive,  personalized  calibration  feedback  after 
each  of  either  2  or  10  sessions  composed  of  200  two- alternative 
general-knowledge  items .  They  found  appreciable  improvement  in 
calibration,  all  of  which  occurred  between  the  first  and  the  second 


session.  Modest  generalization  occurred  for  tasks  with  different 
difficulty  levels,  content,  and  response  mode  (four  rather  than 
two  alternatives) ,  but  no  improvement  was  found  with  a  fractile 
assessment  task  (described  in  the  next  section)  or  on  the  discrim¬ 
ination  of  European  from  American  handwriting  samples. 

Another  approach  to  improving  calibration  is  to  restructure  the 
task  in  a  way  that  discourages  overconfidence.  In  a  study  by  Kor- 
iat,  Lichtenstein,  and  Fischhoff  (1980) ,  subjects  first  responded 
to  30  two-alternative  general-knowledge  items  in  the  usual  way. 

They  then  received  10  additional  items.  For  each  item  they  wrote 
down  all  the  reasons  they  could  think  of  that  supported  or  contra¬ 
dicted  either  of  the  two  possible  answers,  and  then  made  the  usual 
choice  and  probability  assessments.  This  procedure  significantly 
improved  their  calibration.  An  additional  study  helped  to  pinpoint 
the  effective  ingredient  of  this  technique.  After  responding  as 
usual  to  an  initial  set  of  30  items,  subjects  were  given  30  more 
items.  For  each,  they  first  chose  a  preferred  answer,  then  wrote 
(a)  one  reason  supporting  their  chosen  answer,  (b)  one  reason  con¬ 
tradicting.  Then  they  assessed  the  probability  that  their  chosen 
answer  was  correct.  Only  the  group  asked  to  write  contradicting 
reasons  showed  improved  calibration.  This  result,  as  well  as  corre¬ 
lational  analyses  on  the  data  from  the  first  study,  suggests  that 
an  effective  partial  remedy  for  overconfidence  is  to  search  for 
reasons  why  one  might  be  wrong . 

Expertise.  Students  taking  a  college  course  are,  presumably, 
experts,  at  least  temporarily,  in  the  topic  material  of  the  course. 
Sieber  (1974)  reported  excellent  calibration  for  students  taking  a 
practice  raid-term  examination  (i.e.,  the  group  of  students  who  were 
told  that  the  test  was  not  their  mid-term) .  Over  98  percent  of  their 
1.0  responses  and  only  .5  of  their  0.0  responses  were  correct. 

Pitz  (1974)  asked  his  students  to  predict  their  grade  for  his  course; 


they  also  were  well  calibrated. 

Would  these  subjects  have  been  as  well  calibrated  on  items  of 
equivalent  difficulty  that  were  not  in  their  area  of  expertise? 
Lichtenstein  and  Pischhoff  (1977)  asked  graduate  students  in  psycho¬ 
logy  to  respond  to  50  two- alternative  general-knowledge  items  and 
50  items  covering  knowledge  of  psychology  (e.g.,  the  Ishihara  test 
is  (a)  a  perceptual  test,  (b)  a  social  anxiety  test) .  The  two  sub¬ 
tests  were  of  equal  difficulty  and  the  calibration  was  similar 
for  the  two  tasks. 

Christensen-Szalanski  and  Bushyhead  (in  press)  reported  nine 
physicians'  assessments  of  the  probability  of  pneumonia  for  1,531 
patients  who  were  examined  because  of  a  cough.  Their  calibration 
was  abysmal;  the  curve  rose  so  slowly  that  for  the  highest  confi¬ 
dence  level  (approximately  .88),  the  proportion  of  patients  actu¬ 
ally  having  pneumonia  was  less  than  .20.  Similar  results  have 
been  reported  for  diagnoses  of  skull  fracture  and  pneumonia  by 
Lusted  (1977)  and  for  diagnoses  of  skull  fracture  by  DeSmet,  Fry- 
back,  and  Thornbury  (1979) .  The  results  of  these  field  studies 
with  physicians  were  in  marked  contrast  with  the  superb  calibration 
of  weather  forecasters'  precipitation  predictions.  We  suspect 
that  several  factors  favor  the  weather  forecasters.  First,  they 
have  been  making  probabilistic  forecasts  for  years.  Second,  the 
task  is  repetitive;  the  hypothesis  (Will  it  rain?)  is  always  the 
same.  In  contrast,  a  practicing  physician  is  hour  by  hour  consi¬ 
dering  a  wide  array  of  hypotheses  (Is  it  a  skull  fracture?  Does 
she  have  strep?  Does  he  need  further  hospitalization?) .  Finally, 
the  outcome  feedback  for  weather  forecasters  is  well  defined  and 
promptly  received.  This  is  not  always  true  for  physicians;  patients 
fail  to  return  or  are  referred  elsewhere,  or  diagnoses  remain  uncer¬ 
tain. 
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People  who  bet  on  or  establish  the  odds  for  horse  races  might 
also  be  considered  experts.  Under  the  parimutuel  (or  totaliator) 
method,  the  final  odds  are  determined  by  the  amount  of  money  bet 
on  each  horse,  allowing  a  kind  of  group  calibration  curve  to  be 
computed.  Such  curves  (Fabricand,  1965;  Hoerl  &  Fallin,  1974) 
show  excellent  calibration,  with  only  a  slight  tendency  for  people 
to  bet  too  heavily  on  the  long  shots.  However,  such  data  are  only 
inferentially  related  to  probability  assessment.  More  relevant 
are  the  calibration  results  reported  by  Dowie  (1976) ,  who  studied 
the  forecast  prices  printed  daily  by  a  sporting  newspaper  in  Bri¬ 
tain.'  These  predictions,  in  the  form  of  odds,  are  made  by  one 
person  for  all  the  horses  in  a  given  race;  about  eight  people  made 
the  forecasts  during  the  year  studied.  The  calibration  of  the 
forecasts  for  29,307  horses  showed  a  modest  underconfidence  for 
probabilities  greater  than  . 4  and  superb  calibration  for  probabi¬ 
lities  less  than  .4  (which  comprised  98%  of  the  data). 

The  burgeoning  research  on  calibration  has  led  to  the  develop¬ 
ment  of  a  new  kind  of  expertise:  calibration  experts,  who  know 
about  the  common  errors  people  make  in  assessing  probabilities. 
Lichtenstein  and  Fischhoff  (1980a)  compared  the  calibration  of 
eight  such  experts  with  12  naive  subjects  and  15  subjects  who  had 
previously  been  trained  to  be  well  calibrated.  The  normative  ex¬ 
perts  not  only  overcame  the  overconfidence  typically  shown  by 
naive  subjects,  but  apparently  overcompensated,  for  they  were 
underconf ident.  The  experts  were  also  slightly  more  sensitive  to 
item  difficulty  than  the  other  two  groups. 

Future  events .  Wright  and  Wisudha  (1979)  have  speculated 
that  calibration  for  future  events  may  be  different  than  for  gener¬ 
al-knowledge  questions.  If  true,  this  would  limit  extrapolation 
from  research  with  general-knowledge  questions  to  the  prediction 


of  future  events.  Unfortunately,  Wright  and  Wisudha'a  general- 
knowledge,  items  were  more  difficult  than  their  future  events, 
which  could  account  for  the  superior  calibration  of  the  latter. 

Fischhoff  and  Beyth  (1975)  asked  150  Israeli  students  to 
assess  the  probability  of  15  then-future  events,  possible  outcomes 
of  President  Nixon's  much-publicized  trips  to  China  and  Russia 
(e.g..  President  Nixon  will  meet  Mao  at  least  once).  The  result¬ 
ing  calibration  curve  was  quite  close  to  the  identity  line.  How¬ 
ever,  Fischhoff  and  Lichtenstein  (unpublished)  have  recently  found 
that  the  calibration  of  future  events  showed  the  same  severe  over- 
confidence  as  was  shown  for  general-knowledge  items  of  comparable 
difficulty.  Phillips  and  Choo  (unpublished)  obtained  calibration 
curves  for  three  sets  of  items:  general  knowledge,  future  events, 
and  past  events  (e.g. ,  a  jumbo  jet  crashed  killing  more  than  100 
people,  some  time  in  the  past  30  days) .  For  both  British  and  Chi¬ 
nese  subjects,  all  three  curves  showed  overconfidence.  Calibration 
for  future  and  past  events  was  identical,  and  somewhat  better  than 
for  the  general-knowledge  items.  The  difficulty  levels  of  the 
three  sets  of  items  could  not  account  for  these  results. 

Jack  Dowie  and  colleagues  are  now  collecting  calibration  data 
from  several  hundred  students  in  the  Open  University's  course  on 
risk,  using  course- related  questions,  general-knowledge  questions, 
and  future  event  questions.  The  students  received  a  general  intro¬ 
duction  to  the  concept  of  calibration  and  were  given  feedback 
about  their  performance  and  calibration.  Preliminary  results^ 
suggest  that  they  were  moderately  overconfident.  Calibration  was 
best  on  general-knowledge  items  and  worst  on  course-related  items, 
but  the  significance  and  origins  of  these  differences  remain  to  be 
investigated. 


Continuous  Propositions:  Uncertain  Quantities 


The  Fractile  Method 

Uncertainty  about  the  value  of  an  uncertain  continuous 
quantity  (e.g.,  what  proportion  of  students  prefer  Scotch  to 
Bourbon?  What  is  the  shortest  distance  from  England  to  Austral¬ 
ia?)  may  be  expressed  as  a  probability  density  function  across 
the  possible  values  of  that  quantity.  However,  assessors  are  not 
usually  asked  to  draw  the  entire  function.  Instead,  the  elicita¬ 
tion  procedure  most  commonly  used  is  some  variation  of  the  frac¬ 
tile  method.  In  this  method,  the  assessor  states  values  of  the 
uncertain  quantity  that  are  associated  with  a  small  number  of 
predetermined  fractiles  of  the  distribution.  For  the  median  or 
.50  fractile,  for  example,  the  assessor  states  a  value  of  the 
quantity  such  that  the  true  value  is  equally  likely  to  fall  above 
or  below  the  stated  value;  the  .01  fractile  is  a  value  such  that 
there  is  only  1  chance  in  100  that  the  true  value  is  smaller  than 
the  stated  value.  Usually  3  or  5  fractiles,  including  the  median, 
are  assessed.  In  a  variant  called  the  tertile  method,  the  assess¬ 
or  states  two  values  (the  .33  and  .67  fractiles)  such  that  the 
entire  range  is  divided  into  three  equally  likely  sections. 

Two  calibration  measures  are  commonly  reported.  The  inter¬ 
quartile  index  is  the  percentage  of  items  for  which  the  true  value 
falls  inside  the  interquartile  range  (i.e.,  between  the  .25  and 
the  .75  fractiles) .  The  perfectly  calibrated  person  will,  in  the 
long  run,  have  an  interquartile  index  of  50.  The  surprise  index 
is  the  percentage  of  true  values  that  fall  outside  the  most  ex¬ 
treme  fractiles  assessed.  When  the  most  extreme  fractiles  assessed 
are  .01  and  .99,  the  perfectly  calibrated  person  will  have  a  sur¬ 
prise  index  of  2.  A  large  surprise  index  shows  that  the  assessor's 
confidence  bounds  have  been  too  narrow  to  encompass  enough  of  the 
true  values  and  thus  indicates  overconfidence  (or  hyperprecision; 
Pitz,  1974) .  Underconf idence  would  be  indicated  by  an  interquar- 


tile  index  greater  than  50  and  a  low  surprise  index;  no  such  data 
have  been  reported  in  the  literature. 

The  impetus  for  investigating  the  calibration  of  probability 
density  functions  came  from  a  1969  paper  by  Alpert  and  Raiffa  (see 
Chapter  21) .  Alpert  and  Raiffa  worked  with  Harvard  Business 
School  students,  all  familiar  with  decision  analysis.  In  Group  1, 
all  subjects  assessed  five  fractiles,  three  of  which  were  .25, 

.50,  and  .75.  The  extreme  fractiles  were,  however,  different 
for  four  subgroups;  .01  and  .99  (Group  A);  .001  and  .999  (Group 
B) ;  "the  minimum  possible  value"  and  "the  maximum  possible  value" 
(Group  C) ;  and  "astonishingly  low"  and  "astonishingly  high" 

(Group  D) .  The  interquartile  and  surprise  indices  for  these  four 
subgroups  are  shown  in  Table  1.  Discouraged  by  the  enormous  num¬ 
ber  of  surprises,  Alpert  and  Raiffa  then  ran  three  additional 
groups  (2,  3,  and  4)  who,  after  assessing  10  uncertain  quantities, 
received  feedback  in  the  form  of  an  extended  report  and  explana¬ 
tion  of  the  results,  along  with  perorations  to  "Spread  Those  Ex¬ 
treme  Fractiles!".  The  subjects  then  responded  to  10  new  uncer¬ 
tain  quantities.  Results  before  and  after  feedback  are  shown  in 
Table  1.  The  subjects  improved,  but  still  showed  considerable 
overconfidence . 

Hession  and  McCarthy  (1974)  collected  data  comparable  to 
Alpert  and  Raiffa' s  first  experiment,  using  55  uncertain  quantities 
and  36  graduate  students  as  subjects.  Their  instructions  urged 
subjects  to  make  certain  that  the  interval  between  the  .25  frac- 
tile  and  the  .75  fractile  did  indeed  capture  half  of  the  probabi¬ 
lity.  "Later  discussion  with  individual  subjects  made  it  clear 
that  this  consistency  check  resulted  in  most  cases  in  a  readjust¬ 
ment,  decreasing  the  interquartile  range  originally  assessed" 

(p.  7) — thus  making  matters  worse!  This  instructional  emphasis, 
not  used  by  Alpert  and  Raiffa,  may  explain  why  Hession  and  McCar- 
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Table  1 


Calibration  S  usinary  for  Continuous  leans:  Psrcanc  of  Trus 
Valuss  Falling  Within  Interquartile  Kangs 
and  Oueaids  ths  Extreae  Fractlles 


Interquartile 

Indftx* 

Sure rise 

Index 

Alpert  &  Ralffa  (1969) 

H* 

Observed 

Observed 

Ideal 

Group  1-A  (.01,  .99) 

880  > 

i 

46 

2 

Group  1-B  (.001,  .999) 

500 

>  33 

40 

.2 

Group  1-C  ("nin"  &  "nax") 

700 

47 

7 

Group  1— D  ("astonishingly  high/law") 

700  J 

f 

38 

7 

Groups  2,  3,  &  4  Before  Training 

2270 

34 

34 

2 

After  Training 

2270 

44 

19 

2 

Hessian  &  McCarthy  (1974) 

2035 

25 

47  . 

2 

Selvidge  (1975) 

Five  Fractlles 

400 

56 

10 

2 

Seven  Fractlles  (incl.  .1  &  .9) 

520 

50 

7 

2 

Moskowltz  &  Bullers  (1978) 

Proportions 

Three  Fractlles 

120 

- 

27 

2 

Five  Fractlles 

145 

32 

42 

2 

Dow- Jones 

Three  Fractlles 

210 

- 

38 

2 

Five  Fractlles 

210 

20 

64 

2 

Plckhardt  &  Wallace  (1974) 

Group  1,  First  Round 

39 

32 

2 

Fifth  Hound 

? 

49 

20 

2 

Group  2,  First  Round 

7 

30 

46 

2 

Sixth  Round 

7 

45 

24 

2 

Brown  (1973) 

414 

29 

42 

2 

Lichtenstein  &  Fischhoff  (1980b) 

Pretest 

924 

32 

41 

2 

Postteat 

924 

37 

40 

2 

Seaver,  von  Winterfeldt,  &  Edwards  (1978) 

Fractlles 

160 

42 

34 

2 

Odds-Frac  tiles 

160  , 

53 

24 

2 

Probabilities 

180 

57 

5 

2 

Odds 

180 

47 

5 

2 

Log  Odds 

Schaefer  &  Borcherdlng  (1973) 

140 

31 

20 

2 

First  Day,  Fractlles 

396 

23 

39 

2 

Fourth  Day,  Fractlles 

396 

38 

12 

2 

First  Day,  Hypothetical  Sanple 

396 

16 

50 

2 

Fourth  Day,  Hypothetical  Sanple 

396 

48 

6 

2 

Larson  &  Raenan  (1979) 

"Reasonably  Certain" 

450 

- 

42 

? 

Pratt  (Personal  Coosunicatlon) 

"Astonishingly  high/low" 

175 

37 

5 

7 

Murphy  &  Winkler  (1974) 

Extreaes  were  .125  &  .875 

132 

45 

27 

25 

Murphy  &  Winkler  (1977b) 

Extreaes  ware  .125  &  .875 

432 

54 

21 

25 

Stsel  von  Holstein  (1971) 

1269 

27 

30 

2 

*  N  is  ths  total  nuabar  of  assessed  distributions. 

b  The  ideal  percent  of  events  falling  within  ths  Interquartile  range  is  50,  for  all 
experlaents  except  Brown  (1973).  He  elicited  the  .30  and  .70  fractlles,  so  ths 
ideal  is  401. 
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thy's  subjects  were  so  badly  calibrated,  as  shown  in  Table  1. 

Hession  and  McCarthy  also  gave  their  subjects  a  number  of 
individual  difference  measures:  Authoritarianism,  Dogmatism, 
Rigidity,  Pettigrew's  Category-Width  Scale,  and  Intelligence. 

The  correlations  of  the  subjects'  test  scores  with  their  inter¬ 
quartile  and  surprise  indices  were  mostly  quite  low,  although 
the  Authoritarian  scale  correlated  -.31  with  the  interquartile 
score  and  +  .47  with  the  surprise  score  (N  =  28).  This  is  con¬ 
sistent  with  Wright  and  Phillips'  (1976)  finding  that  Authoritar¬ 
ianism  was  modestly  related  to  calibration. 

Selvidge  (1975)  extended  Alpert  and  Raiffa's  work  by  first 
asking  subjects  four  questions  about  themselves  (e.g.,  Do  you 
prefer  Scotch  or  Bourbon?) .  Their  responses  determined  the  true 
answer  for  these  group-generated  proportions  (e.g,  what  proportion 
of  the  subjects  answering  the  questionnaire  preferred  Scotch  to 
Bourbon?).  One  group  gave  five  fractiles,  .01,  .25,  .5,  .75,  and 
.99.  Another  group  gave  those  five  plus  two  others:  .1  and  .9. 

As  shown  in  Table  1,  the  seven-fractile  group  did  a  bit  better. 

The  five-fractile  results  are  not  as  different  from  Alpert  and 
Raiffa's  results  as  they  appear.  Three  of  Alpert  and  Raiffa's 
uncertain  quantities  were  group-generated  proportions  similar  to 
Selvidge's  items.  On  these  three  items,  Alpert  and  Raiffa  found 
57%  in  the  interquartile  range  and  20%  surprises.  Finally,  for 
one  of  the  items,  half  the  subjects  in  the  five-fractile  group 
were  asked  to  give  .25,  .5,  and  .75  first,  and  then  to  give  .01 
and  .99,  while  the  other  half  were  instructed  to  assess  the  ex¬ 
tremes  first.  Selvidge  found  fewer  surprises  for  the  former  order 
(8%)  than  for  the  latter  (16%)  . 

Moskowitz  and  Bullers  (1978)  also  used  group-generated  pro¬ 
portions,  but  found  many  more  surprises  than  did  Selvidge.  One 
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group  gave  the  same  five  fractiles  that  Selvidge  used  (in  the 
order  .5,  .25,  .75,  .01,  .99).  Another  group  was  asked  for  only 
three  assessments  (the  mode  of  the  distribution  and  the  .01  and 
.99  fractiles).  Before  making  their  assessments,  the  three-frac- 
tile  group  received  a  presentation  and  discussion  of  some  typical 
reference  events  (e.g.,  "Consider  a  lottery  in  which  100  people 
are  participating.  Your  chance  of  holding  the  winning  ticket  is 
1  in  100")  designed  to  give  assessors  a  better  understanding  of 
the  meaning  of  a  .01  probability.  As  shown  in  Table  1,  the  three- 
fractile  group  had  fewer  surprises  than  the  f ive-fractile  group. 
In  another  experiment  using  the  same  two  methods,  Moskowitz  and 
Bullers  asked  44  undergraduate  commerce  students  to  assess  the 
average  value  of  the  Dow- Jones  Industrial  Index  for  1977,  1974, 
1965,  1960,  and  1950.  Each  subject  gave  assessments  before  and 
after  engaging  in  three-person  discussions.  Since  no  systematic 
differences  were  found  due  to  the  discussions,  the  data  have  been 
combined  in  Table  1.  Again,  the  three-fractile  group  (who  had 
received  the  presentation  on  the  meaning  of  .01)  had  fewer  sur¬ 
prises  than  the  f ive-fractile  group.  The  performance  of  the  f ive- 
fractile  group  was  extremely  bad. 

Pickhardt  and  Wallace  (1974) replicated  Alpert  and  Raiffa's 
work,  with  variations.  Across  several  groups  they  reported  38  to 
48  percent  surprises  before  feedback,  and  not  less  than  30%  sur¬ 
prises  after  feedback.  Two  variations,  using  or  not  using  course 
grade  credit  as  a  reward  for  good  calibration,  and  using  or  not 
using  scoring  rule  feedback,  made  no  difference  in  the  number  of 
surprises.  Pickhardt  and  Wallace  also  studied  the  effects  of 
extended  training:  Two  groups  of  18  and  30  subjects  (number  of 
uncertain  quantities  not  reported)  responded  for  five  and  six  ses¬ 
sions  with  calibration  feedback  after  every  session.  Modest  im¬ 
provement  was  found,  as  shown  in  Table  1. 
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Finally,  Pickhardt  and  Wallace  studied  the  effects  of  in¬ 
creasing  knowledge  on  calibration  in  the  context  of  a  production 
simulation  game  called  PROSIM.  Thirty-two  graduate  students  each 
made  51  assessments  during  a  simulated  17  "days"  of  production 
scheduling.  Each  assessment  concerned  an  event  that  would  occur 
1,  2,  or  3  "days"  hence.  The  closer  the  time  of  assessment  to  the 
time  of  the  event,  the  more  the  subject  knew  about  the  event. 
Overconfidence  decreased  with  this  increased  information:  there 
were  32%  surprises  with  3-day  lags,  24%  with  2-day  lags,  and  7% 
with  1-day  lags.  No  improvement  was  observed  over  the  17  "days" 
of  the  simulation. 

Brown  (1973)  asked  31  subjects  to  assess  seven  fractiles  (.01, 
.10,  .30,  .50,  .70,  .90,  .99)  for  14  uncertain  quantities.  The 
results,  shown  in  Table  1,  are  particularly  discouraging,  because 
each  question  was  accompanied  by  extensive  historical  data  (e.g. , 
for  "Where  will  the  Consumer  Price  Index  stand  in  December,  1970?," 
subjects  were  given  the  Consumer  Price  Index  for  every  quarter 
between  March,  1962,  and  June,  1970) .  For  11  of  the  questions, 
had  the  subjects  given  the  historical  minimum  as  their  .01  frac- 
tile  and  the  historical  maximum  as  their  .99  fractile,  they  would 
have  had  no  surprises  at  all.  The  other  three  questions  showed 
strictly  increasing  or  strictly  decreasing  histories,  and  the  true 
value  was  close  to  any  simple  approximation  of  the  historical 
trend.  The  subjects  must  have  been  relying  heavily  on  their  own 
erroneous  knowledge  to  have  given  distributions  so  tight  as  to 
produce  42%  surprises. 

Lichtenstein  and  Fischhoff  (1980b)  elicited  five  fractiles 
(.01,  .25,  .5,  .75,  .99)  from  12  subjects  on  77 

uncertain  quantities  both  before  and  after  the  subjects  received 
extensive  calibration  training  on  two-alternative  discrete  items. 
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As  shown  in  Table  1,  the  subjects  did  not  significantly  im¬ 
prove  their  calibration  of  uncertain  quantities. 

Other  Methods 


Seaver,  von  Winterfeldt,  and  Edwards  (1978)  studied  the 
effects  of  five  different  response  modes  on  calibration.  Two 
groups  used  the  fractile  method,  either  five  fractiles  (.01,  .25, 
.50,  .75,  .99)  or  the  odds  equivalents  of  those  fractiles  (1:99, 
1:3,  1:1,  3:1,  99:1).  Three  other  groups  responded  with  probabi¬ 
lities,  odds,  or  odds  on  a  log-odds  scale  to  one- a Iter native 
questions  that  specified  a  particular  value  of  the  uncertain  quan 
tity  (e.g..  What  is  the  probability  that  the  population  of  Canada 
in  1973  exceeded  25  million?) .  Five  such  fixed  values  were  given 
for  each  uncertain  quantity,  and  from  the  responses  the  experi¬ 
menters  estimated  the  interquartile  and  surprise  indices.  For 
each  method,  seven  to  nine  students  responded  to  20  uncertain 
quantities.  As  shown  in  Table  1,  the  groups  giving  probabilistic 
and  odds  responses  had  distinctly  better  surprise  indices  than 
those  using  the  fractile  method.  It  is  unclear  whether  this  su¬ 
periority  is  due  to  the  information  communicated  by  the  values 
chosen  by  the  experimenter.  The  log-odds  response  mode  did  not 
work  out  well. 

Schaefer  and  Borcherding  (1973)  asked  22  students  to  assess 
18  group-generated  proportions  in  each  of  four  sessions.  Each 
subject  used  two  assessment  techniques:  (1)  the  fractile  method 
(.01,  .125,  .25,  .5,  .75,  .875,  .99),  and  (2)  the  hypothetical 
sample  method.  In  the  latter  method,  the  assessor  states  the 
size,  n,  and  the  number  of  successes,  r,  of  a  hypothetical  sample 
that  best  reflects  the  assessor's  knowledge  about  the  uncertain 
quantity  (i.e.,  I  feel  as  certain  about  the  true  value  of  the 
proportion  as  I  would  feel  were  I  to  observe  a  sample  of  n  cases 
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with  r  successes) .  Larger  values  of  n  reflect  greater  certainty 
about  the  true  value  of  the  proportion.  The  ratio  r/n  reflects 
the  mean  of  the  probability  density  function.  Subjects  had  great 
difficulty  with  this  method,  despite  instructions  that  included 
examples  of  the  beta  distributions  underlying  this  method.  After 
every  session,  subjects  were  given  extensive  feedback,  with  em¬ 
phasis  on  their  own  and  the  group's  calibration.  The  results 
from  the  first  and  last  sessions  are  shown  in  Table  1.  Improve¬ 
ment  was  found  for  both  methods.  Results  from  the  hypothetical 
sample  method  started  out  worse  (50%  surprises  and  only  16%  in 
the  interquartile  range) ,  but  ended  up  better  (6%  surprises  and 
48%  in  the  interquartile  range)  than  the  fractile  method. 

Barclay  and  Peterson  (1973)  compared  the  tertile  method  (i.e. 
the  fractiles  .33  and  .67)  with  a  "point"  method  in  which  the 
assessor  is  asked  to  give  the  modal  value  of  the  uncertain  quan¬ 
tity,  and  then  two  values,  one  above  and  one  below  the  mode,  each 
of  which  is  half  as  likely  to  occur  as  is  the  modal  value  (i.e., 
points  for  which  the  probability  density  function  is  half  as 
high  as  at  the  mode) .  Using  10  almanac  questions  as  uncertain 
quantities  and  70  students  at  the  Defense  Intelligence  School  in 
a  wi thin-subject  design,  they  found  for  the  tertile  method  that 
29%  (rather  than  33%)  of  the  true  answers  fell  in  the  central 
interval.  For  the  point  method,  only  39%  fell  between  the  two 
half-probable  points,  whereas,  for  most  distributions,  approxi¬ 
mately  75%  of  the  density  falls  between  these  points. 

Pitz  (1974)  reported  several  results  using  the  tertile  method 
For  19  subjects  estimating  the  populations  of  23  countries,  he 
found  only  16%  of  the  true  values  falling  inside  the  central  third 
of  the  distributions.  In  another  experiment  he  varied  the  items 
according  to  the  depth  and  richness  of  knowledge  he  presumed  his 
subjects  to  have.  With  populations  of  countries  (low  knowledge) 
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he  found  23%  of  the  true  values  in  the  central  third;  with 
heights  of  well-known  buildings  (middling  knowledge),  27%;  and 
with  ages  of  famous  people  (high  knowledge),  47%,  the  last  being 
well  above  the  expected  33%.  In  another  study,  he  asked  six  sub¬ 
jects  to  assess  tertiles,  and  a  few  days  later  to  choose  among 
bets  based  on  their  own  tertile  values.  He  found  a  strong  pref¬ 
erence  for  bets  involving  the  central  region,  just  the  reverse  of 
what  their  too- tight  intervals  should  lead  them  to. 

Larson  and  Reenan's  (1979)  subjects  first  gave  their  best 
guess  at  the  true  answer  (i.e.,  the  mode),  and  then  two  more  va¬ 
lues  that  defined  an  interval  within  which  they  were  "reasonably 
certain"  the  correct  answer  lay.  Forty-two  percent  of  the  true 
values  lay  outside  this  region.  Note  how  similar  this  surprise 
index  is  to  the  indices  of  Alpert  and  Raiffa's  subjects  given 
the  verbal  phrases  "minimum/maximum"  947%)  and  "astonishingly 
high/low"  (38%) . 

Real  Tasks  with  Experts 

Pratt6  asked  a  single  expert  to  predict  movie  attendance  for 
175  movies  or  double  features  shown  in  two  local  theaters  over  a 
period  of  more  than  one  year.  The  expert  assessed  the  median, 
quartiles,  and  "astonishingly  high"  and  "astonishingly  low" 
values.  As  shown  in  Table  1,  the  interquartile  range  tended  to 
be  too  small.  Even  though  the  expert  received  outcome  feedback 
throughout  the  experiment,  the  only  evidence  of  improvement  in 
calibration  over  time  came  in  the  first  few  days. 

Three  experiments  used  weather  forecasters  for  subjects.  In 
two  experiments,  Murphy  and  Winkler  (1974,  1977b),  asked  weather 
forecasters  to  give  five  fractiles  (.125,  .25,  .5,  .75,  .875)  for 
tomorrow's  high  temperature.  The  results,  shown  in  Table  1,  indi- 
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cate  excellent  calibration.  These  subjects  had  fewer  surprises 
in  the  extreme  25%  of  the  distribution  than  did  most  of  Alpert 
and  Raiffa's  subjects  in  the  extreme  2%!  Murphy  and  Winkler 
found  that  the  five  subjects  in  the  two  experiments  who  used  the 
fractile  method  were  better  calibrated  than  four  other  subjects 
who  used  a  fixed-width  method.  For  the  fixed-width  method,  the 
forecasters  first  assessed  the  median  temperature  (i.e.,  the  high 
temperature  for  which  they  believed  there  was  a  . 5  probability 
that  it  would  be  exceeded) .  Then  they  stated  the  probability 
that  the  temperature  would  fall  within  intervals  of  5°F  and  of 
9°F  centered  at  the  median.  These  forecasters  were  overconfident; 
the  probability  associated  with  the  temperature  falling  inside 
the  interval  tended  to  be  too  large.  The  superiority  of  the  frac¬ 
tile  method  over  the  fixed-width  method  stands  in  contrast  with 
Seaver,  von  Winterfeldt,  and  Edwards'  finding  that  fixed-value 
methods  were  superior,  perhaps  because  the  fixed  intervals  used 
by  Murphy  and  Winkler  (5°F  and  9°F)  were  noninf ormative . 

Stael  von  Holstein  (1971)  used  three  fixed-value  tasks; 

(1)  Average  temperature  tomorrow  and  the  next  day  (dividing  the 
entire  response  range  into  8  categories) ,  (2)  average  temperature 
four  and  five  days  from  now  (8  categories) ,  and  (3)  total  amount 
of  rain  in  the  next  five  days  (4  categories) .  From  each  set  of 
responses  (4  or  8  probabilities  summing  to  1.0),  he  estimated 
the  underlying  cumulative  density  function.  He  then  combined 
the  1,269  functions  given  by  28  participants.  From  the  group 
cumulative  density  function  shown  in  his  paper,  we  have  estimated 
the  surprise  and  interquartile  indices  (see  Table  1) .  In  contrast 
to  other  weather  forecasters,  these  subjects  were  quite  poorly 
calibrated,  perhaps  because  the  tasks  were  less  familiar. 

Summary  of  Calibration  with  Uncertain  Quantities 


The  overwhelming  evidence  from  research  using  fractiles  to 


assess  uncertain  quantities  is  that  people's  probability  distri¬ 
butions  tend  to  be  too  tight.  The  assessment  of  extreme  fractiles 
is  particularly  prone  to  bias.  Training  improves  calibration 
somewhat.  Experts  sometimes  perform  well  (Murphy  &  Winkler,  1974, 
1977b),  sometimes  not  (Pratt,6  Stael  von  Holstein,  1971).  There 
is  some  evidence  that  difficulty  is  related  to  calibration  for 
continuous  propositions.  Pitz  (1974)  and  Larson  and  Reenan  (1979) 
found  such  an  effect,  and  Pickhardt  and  Wallace's  (1974)  finding 
that  one-day  lags  led  to  fewer  surprises  than  three-day  lags  in 
their  stimulation  game  is  relevant  here.  Several  studies  (e.g., 
Barclay  &  Peterson,  1973:  Murphy  &  Winkler,  1974)  have  reported 
a  correlation  between  the  spread  of  the  assessed  distribution  and 
the  absolute  difference  between  the  assessed  median  and  the  true 
answer,  indicating  that  subjects  do  have  a  partial  sensitivity 
to  how  much  they  do  or  don't  know.  This  finding  parallels  the 
correlation  between  percent  correct  and  mean  response  with  dis¬ 
crete  propositions . 


Discussion 


Why  Be  Well  Calibrated? 


Why  should  a  probability  assessor  worry  about  being  well 
calibrated?  Von  Winter feldt  and  Edwards  (1973)  have  shown  that 
in  most  real-world  decision  problems  with  continuous  decision 
options  (e.g.,  invest  $X) ,  fairly  large  assessment  errors  make 
relatively  little  difference  in  the  expected  gain.  However, 
several  considerations  argue  against  this  reassuring  view.  First, 
in  a  two-alternative  situation,  the  payoff  function  can  be  quite 
steep  in  the  crucial  region.  Suppose  your  doctor  must  decide 
the  probability  that  you  have  condition  A,  and  should  receive 
treatment  A,  versus  having  condition  B  and  receiving  treatment  B. 


Suppose  that  the  utilities  are  such  that  treatment  A  is  better 
if  the  probability  that  you  have  condition  A  is  i  .4;  otherwise 
treatment  B  is  better.  If  the  doctor  assesses  the  probability 
that  you  have  A  as  p(A)  =  .45,  but  is  poorly  calibrated,  so  that 
the  appropriate  probability  is  .25,  then  the  doctor  would  use 
treatment  A  rather  than  treatment  B  and  you  would  lose  quite  a 
chunk  of  expected  utility.  Real-life  utility  functions  of  just 
this  type  are  shown  by  Fryback  (1974). 

Furthermore,  when  the  payoffs  are  very  large,  when  the  er¬ 
rors  are  very  large,  or  when  such  errors  compound,  the  expected 
loss  looms  large.  For  instance,  in  the  Reactor  Safety  Study 
(U.S.  NRC ,  1975)  "at  each  level  of  the  analysis  a  log-normal 
distribution  of  failure  rate  data  was  assumed  with  5  and  95  per¬ 
centile  limits  defined"  (Weatherwax,  1975,  p.  31).  The  research 
reviewed  here  suggests  that  distributions  built  from  assessments 
of  the  .05  and  .95  fractiles  may  be  grossly  biased.  If  such 
assessments  are  made  at  several  levels  of  an  analysis ,  with  each 
assessed  distribution  being  too  narrow,  the  errors  will  not  can¬ 
cel  each  other,  but  will  compound.  And  because  the  costs  of 
nuclear  power  plant  fa ' lure  are  large ,  the  expected  loss  from 
such  errors  could  be  enormous. 

If  good  calibration  is  important,  how  can  it  be  achieved? 

Cox  (1958)  recommended  that  one  externally  recalibrate  people's 
assessments  by  fitting  a  model  to  a  set  of  assessments  for  items 
with  known  answers,  From  then  on,  the  model  is  used  to  correct 
or  adjust  responses  given  by  the  assessor.  The  technical  diffi¬ 
culties  confronting  external  recalibration  are  substantial.  When 
eliciting  the  assessments  to  be  modeled,  one  would  have  to  be 
careful  not  to  give  the  assessors  any  more  feedback  than  they 
normally  receive,  for  fear  of  their  changing  their  calibration 


as  it  is  being  measured.  As  Savage  (1971)  pointed  out,  "you 
might  discover  with  experience  that  your  expert  is  optimistic  or 
pessimistic  in  some  respect  and  therefore  temper  his  judgments. 
Should  he  suspect  you  of  this,  however,  you  and  he  may  well  be 
on  the  escalator  to  perdition"  (p.  796)  .  Furthermore,  since  re¬ 
search  has  shown  that  the  type  of  miscalibration  observed  depends 
on  a  task's  difficulty  level,  one  would  also  have  to  believe  that 
the  future  will  match  the  difficulty  of  the  events  used  for  the 
recalibration. 

The  theoretical  objections  to  external  recalibration  may  be 
even  more  serious  than  the  practical  objections.  The  numbers  pro¬ 
duced  by  a  recalibration  process  will  not,  in  general,  follow  the 
axioms  of  probability  theory  (e.g.,  the  numbers  associated  with 
mutually  exclusive  and  exhaustive  events  will  not  always  sum  to 
one,  nor  will  it  be  generally  true  that  P (A)  •  P(B)  =  P(A,B)  for 
independent  events) ;  hence,  these  new  numbers  cannot  be  called 
probabilities . 

A  more  fruitful  approach  would  be  to  train  assessors  to 
become  well  calibrated.  Under  what  conditions  might  one  expect 
that  assessors  could  achieve  this  goal? 

One  should  not  expect  assessors  to  be  well  calibrated  when 
the  explicit  or  implicit  rewards  for  their  assessments  do  not 
motivate  them  to  be  honest  in  their  assessments.  As  an  extreme 
example,  an  assessor  who  is  threatened  with  beheading  should  any 
event  occur  whose  probability  was  assessed  at  <.25  will  have  good 
reason  not  to  be  well  calibrated  with  assessments  of  .20.  Although 
this  example  seems  absurd,  more  subtle  pressures  such  as  "avoid 
being  made  to  look  the  fool"  or  "impress  your  boss"  might  also 
provide  strong  incentives  for  bad  calibration.  Any  rewards  for 
either  wishful  thinking  or  denial  could  also  bias  the  assessments. 


Receiving  outcome  feedback  after  every  assessment  is  the 
best  condition  for  successful  training.  Dawid  (in  press)  has 
shown  that  under  such  conditions  assessors  who  are  honest  and 
coherent  subjectivists  will  expect  to  be  well  calibrated  regard¬ 
less  of  the  interdependence  among  the  items  being  assessed.  In 
contrast,  Kadane7  has  shown  that,  in  the  absence  of  trial-by¬ 
trial  outcome  feedback,  honest,  coherent  subjectivists  will  ex¬ 
pect  to  be  well  calibrated  if  and  only  if  all  the  items  being 
assessed  are  independent.  This  theorem  puts  strong  restrictions 
on  the  situations  under  which  it  would  be  reasonable  to  expect 
assessors  to  learn  to  be  well  calibrated.  Even  if  the  training 
process  could  be  conducted  using  only  events  that  assessors  be¬ 
lieved  were  independent,  there  may  be  good  reason  to  doubt  the 
independence  of  the  real-life  tasks  to  which  the  assessors  would 
apply  their  training.  Important  future  events  may  be  interdepen¬ 
dent  either  because  they  are  influenced  by  a  common  underlying 
cause  or  because  the  assessor  evaluates  all  of  them  by  drawing  on 
a  common  store  of  knowledge.  In  such  circumstances ,  one  would 
not  want  or  expect  to  be  well  calibrated. 

The  possibility  that  people's  biases  vary  as  a  function  of 
the  difficulty  of  the  tasks  poses  a  further  obstacle  to  calibra¬ 
tion  training  in  the  absence  of  immediate  outcome  feedback.  The 
difficulty  level  of  future  tasks  may  be  impossible  to  predict, 
thus  rendering  the  training  ineffective. 

Calibration  as  Cognitive  Psychology 

Experiments  on  calibration  can  be  used  to  learn  how  people 
think.  Even  if  the  immediate  practical  significance  of  each  study 
is  limited,  it  may  still  provide  greater  understanding  of  how 
people  develop  and  express  feelings  of  uncertainty  and  certainty. 
However,  a  striking  aspect  of  much  of  the  literature  reviewed  here 
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is  its  "dust-bowl  empiricism."  Psychological  theory  is  often 
absent,  either  as  motivation  for  the  research  or  as  explanation 
of  the  results. 


Not  all  authors  have  avoided  theorizing.  Slovic  (1972) 
and  tfversky  and  Kahneman  (1974)  argued  that,  as  a  result  of  lim¬ 
ited  information-processing  abilities,  people  adopt  simplifying 
rules  or  heuristics.  Although  generally  quite  useful,  these 
heuristics  can  lead  to  severe  and  systematic  errors.  For  example 
the  tendency  of  people  to  give  unduly  tight  distributions  when 
assessing  uncertain  quantities  could  reflect  the  heuristic  called 
"anchoring  and  adjustment."  When  asked  about  an  uncertain  quan¬ 
tity,  one  naturally  thinks  first  of  a  point  estimate  such  as  the 
median.  This  value  then  serves  as  an  anchor.  To  give  the  25th 
or  75th  percentile,  one  adjusts  downward  or  upward  from  the  an¬ 
chor.  But  the  anchor  has  such  a  dominating  influence  that  the 
adjustment  is  insufficient;  hence  the  fractiles  are  too  close 
together,  yielding  overconfidence. 

Pitz  (1974),  too,  accepted  that  people's  inf ormation- proces¬ 
sing  capacity  and  working  memory  capacity  are  limited.  He  sugges¬ 
ted  that  people  tackle  complex  problems  serially,  working  through 
a  portion  at  a  time.  To  reduce  cognitive  strain,  people  ignore 
the  uncertainty  in  their  solutions  to  the  early  portions  of  the 
problem  in  order  to  reduce  the  complexity  of  the  calculations 
in  later  portions.  This  could  lead  to  too-tight  distributions 
and  overconfidence.  Pitz  also  suggested  that  one  way  people 
estimate  their  own  uncertainty  is  by  seeing  how  many  different 
ways  they  can  arrive  at  an  answer,  that  is,  how  many  different 
serial  solutions  they  can  construct.  If  many  are  found,  people 
will  recognize  their  own  uncertainty;  if  few  are  found,  they  will 
not.  The  richer  the  knowledge  base  from  which  to  build  alterna- 
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tive  structures,  the  less  the  tendency  toward  overconfidence. 


Phillips  and  Wright  (1977)  presented  a  three-stage  serial 
model.  Their  model  distinguishes  people  who  tend  naturally  to 
think  about  uncertainty  in  a  probabilistic  way  from  those  who 
respond  in  a  more  black-and-white  fashion.  They  work  on  cultural 
and  individual  differences  (Wright  &  Phillips,  1976;  Wright  et  al., 
1978)  has  attempted,  with  partial  success,  to  identify  distinct 
cognitive  styles  in  processing  this  type  of  information. 

Koriat  et  al.  (1980)  also  took  an  information-processing 
approach.  They  discussed  three  stages  for  assessing  probabili¬ 
ties.  First,  one  searches  one's  memory  for  relevant  evidence. 
Next,  one  assesses  that  evidence  to  arrive  at  a  feeling  of  cer¬ 
tainty  or  doubt.  Finally,  one  translates  the  certainty  feeling 
into  a  number.  The  manipulations  used  by  Koriat  et  al.  were 
designed  to  alter  the  first  two  stages,  by  forcing  people  to 
search  for  and  attend  to  contradictory  evidence,  thereby  lowering 
their  confidence. 

Ferrell  and  McGoey's  (1980)  model,  on  the  other  hand,  deals 
entirely  with  the  third  stage,  translation  of  feelings  of  cer¬ 
tainty  into  numerical  responses.  By  assuming  that,  without  feed¬ 
back,  people  are  unable  to  alter  their  translation  strategies  as 
either  the  difficulty  of  the  items  or  the  base  rate  of  the  events 
changes,  the  model  provides  strong  predictions  which  have  received 
support  from  calibration  data. 

Structure  and  process  theories  of  probability  assessment  are 
beginning  to  emerge;  we  hope  that  the  further  development  of  such 
theories  will  serve  to  integrate  this  rather  specialized  field 
into  the  broader  field  of  cognitive  psychology. 
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Footnotes 


The  writing  of  this  paper,  and  our  research  reported  herein, 
were  supported  by  contracts  from  the  Advanced  Research  Projects 
Agency  of  the  Department  of  Defense  (Contracts  N00014-73-C-0438 
and  N00014-76-C-0074)  and  the  Office  of  Naval  Research  (Contract 
N00014-80-C-0150) . 

We  are  grateful  to  P.  Slovic,  L.  R.  Goldberg,  A.  Tversky, 

R.  Schaefer,  D.  Kahneman,  and  most  especially  K.  Borcherding  for 
their  helpful  suggestions. 

1.  The  references  by  Cooke  (1906) ,  Williams  (1951) ,  and 
Sanders  (1958)  were  brought  to  our  attention  through  an  unpub¬ 
lished  manuscript  by  Howard  Raiffa,  dated  January  1969,  entitled 
"Assessments  of  Probabilities." 

2.  Personal  communication,  August,  1980. 

3.  The  MMPI  (Minnesota  Multiphasic  Personality  Inventory) 
is  a  personality  inventory  widely  used  for  psychiatric  diagnosis. 
A  profile  is  a  graph  of  13  subscores  from  the  inventory. 

4.  MMPI  buffs  might  note  that  with  this  minimal  training 
the  undergraduates  showed  as  high  an  accuracy  as  either  the  best 
experts  or  the  best  actuarial  prediction  systems. 

5.  J.  Dowie,  personal  communication,  November,  1980. 

6.  J.  W.  Pratt,  personal  communication,  October,  1975. 

7.  J.  Kadane,  personal  communication,  November,  1980. 
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